Context Rot: why more context can make your model worse.
Bigger context windows feel like free memory. In practice, model accuracy decays as the input grows — even on trivial tasks, and long before the window is full. Here's the evidence, the mechanisms, and what an FDE should do about it.
Primary source: Hong, Troynikov & Huber, “Context Rot: How Increasing Input Tokens Impacts LLM Performance,” Chroma (July 2025). Charts below are original visualizations built for this lesson, recreating the trends reported in the cited research.
The uniform-context myth
We tend to assume a model treats its 10,000th token as reliably as its 100th. Chroma's evaluation of 18 frontier models — including GPT-4.1, Claude 4, Gemini 2.5 and Qwen3 — shows that assumption is false: models do not use their context uniformly, and reliability erodes as input length grows, even when the task itself stays trivially simple.
The popular Needle-in-a-Haystack test made long context look solved — but it only measures literal keyword retrieval. Once you require semantic matching, add distractors, or scale output alongside input, performance slips in surprising, non-uniform ways.
Lost in the middle
The earliest and most cited symptom: position matters. Liu et al. (Stanford, 2023) found accuracy traces a U-shape — models recall facts placed at the very start or end of the context far more reliably than facts buried in the middle, a gap that can exceed 30 percentage points.
A likely architectural cause is the long-range decay built into rotary position embeddings: distant token pairs get systematically lower attention, and softmax sharpens the bias toward the start and end of the sequence.
It isn't just position — it's similarity
Real queries rarely share exact keywords with the answer. When Chroma varied the semantic similarity between the question and the “needle,” low-similarity pairs degraded much faster as input grew — the model has to infer relevance instead of pattern-matching it. NoLiMa (Modarressi et al., 2025) reports the same: drop literal overlap and long-context scores collapse.
Distractors compound the damage
Add text that's topically related but doesn't answer the question, and accuracy drops further. Chroma found a single distractor already hurts, and four compound the effect — amplified at longer inputs. Notably, model families differ: Claude tends to abstain under ambiguity, while GPT models more often hallucinate a confident wrong answer. This echoes Shi et al. (2023), “LLMs Can Be Easily Distracted by Irrelevant Context.”
Retrieval + reasoning: focused beats full
Dumping a whole chat history into the prompt forces the model to do two jobs at once — find the relevant parts and reason over them. On LongMemEval (Wu et al., 2025), Chroma compared a ~113K-token “full” prompt against a ~300-token “focused” prompt containing only the relevant turns. Every family scored far higher on the focused input.
Structure, absence, and hard thresholds
Three more findings round out the picture. Counter-intuitively, Chroma found models score better on a randomly shuffled haystack than on a logically coherent one — structure changes how attention is spent. Separately, AbsenceBench (Fu et al., 2025) shows models struggle to notice what's missing from a long input. And recent threshold analysis (2026) reports some models collapse abruptly — a >40% F1 drop — once a critical fraction of the window is crossed, rather than degrading smoothly.
“Whether the answer is in the context isn't what matters most — what matters is how that information is presented.”
— Chroma, Context Rot (2025)
What this means for an FDE
The fix isn't a bigger window — it's context engineering: deliberately curating what enters the prompt.
References & sources
- Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. trychroma.com/research/context-rot
- Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
- Modarressi, A., et al. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. arXiv:2502.05167
- Wu, D., et al. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv:2410.10813
- Shi, F., et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. arXiv:2302.00093
- Fu, H. Y., et al. (2025). AbsenceBench: Language Models Can't Tell What's Missing. arXiv:2506.11440
- Hsieh, C. P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654
- Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination (2026). arXiv:2601.15300